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SUMMARY 

This paper describes a method for adjusting the analysis of occupational/environmental lung cancer risks for 
the effects of cigarette smoking in cohort and case-control studies. The method uses a function that relates an 
individual’s death rate to his age and cigarette smoking history. Two such functions are examined. The first 
depends on total packs of cigarettes smoked and age. The second, based on the multistage theory of 
carcinogenesis, depends on age, age at start of smoking, and subsequent smoking rates. The lung cancer rates 
predicted by these two functions are compared to those observed in cohort studies of male British physicians 
and U.S. veterans, and in a case-control study of non-Hispanic white men in New Mexico. Neither of the 
cohort data sets distinguished the fit of the two functions. The New Mexico data were fit better by the second 
function, though both functions overpredicted death rates among ex-smokers. Each function explained 
substantially more variation in the New Mexico data than did any of several logistic regression models 
involving categorical variables for age and smoking. 
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1. INTRODUCTION 

According to recent estimates, 1 lung cancer comprises 68 per cent of occupationally induced 
cancer, with approximately 11,000 occupational lung cancer deaths occurring in the U.S. in 1977. 
The number of lung cancers attributable to indoor and outdoor air pollution is less certain. It is 
important to examine temporal patterns of mortality associated with known hazards to determine 
acceptable exposure levels, to evaluate risk in susceptible subgroups, and to predict future claims 
for damages from past exposures. 

Cigarette smoking dominates all environmental agents as a cause of lung cancer. Doll and Peto 1 
estimated that 88 percent ofthe92,0001migcancerdeaths occurring in the U.S. in 1977 were due to 
smoking. Therefore the detection of increased risks from other hazards requires control for the 
potential confounding effects of cigarette smoking. Such control is particularly important in cohort 
studies extending over many years, during which large changes may have occurred in occupational 
exposures, cigarette smoking habits, and lung cancer death rates. Accurate modelling of smoking- 
induced risk also is needed to determine if smokers differ from non-smokers in susceptibility to 
other environmental lung carcinogens. 

This paper proposes a method for dealing with smoking in cohort and case-control studies of 
lung cancer. The method involves a function that relates an individual’s lung cancer rate to his or 
her sex, age, and cigarette smoking history. 
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Figure 1. (•) Annual U.S. cigarette consumption in cigarettes per capita from 1920-1979: National Economics Division, 
Economic Research Service, U.S. Department of Agriculture, as reported by Kristein. 1 (■) U.S. respiratory cancer 
mortality rates, for both sexes combined, from 6th or 7th ICDA codes 162-164 for 1940-1959 and 8th ICDA code 162 for 
1960-1979. Source: Vital and Health Statistics of the U.S., as reported by Kristein. 2 


The rationale for such a function is seen in Figures 1 and 2. Figure 1 shows temporal trends in ' 

cigarette consumption and in lung cancer death rates for both sexes combined in the U.S. from ; 

1920 to 1980. 2 The two curves are nearly parallel. A simple linear regression of death rates against I 

consumption 20 years prior, shown in Figure 2, indicates that cigarette consumption explains 93 
per cent of the temporal variance in lung cancer mortality. These results suggest that temporal 
trends in lung cancer death rates depend only on peoples’ages and smoking histories, independent 
of birth years or other cohort effects. These variables also describe much of the spatial and 
socioeconomic variation in lung cancer rates, with some exceptions (for example, rates among non¬ 
smoking Chinese females). ' 

Section 2 outlines ways to use such a function when analysing data from cohort or case-control 
studies of occupational or environmental lung carcinogens. When applied to data for subjects with 
no exposures other than tobacco, the methods also can be used to evaluate the function’s adequacy { 

as a descriptive summary of age- and smoking-induced risk. The reader interested chiefly in the 
epidemiology of smoking and lung cancer can skip Section 2 and proceed directly to Section 3. This \ 

section describes two possible forms for the function, and relates their qualitative predictions to the 
known features of lung cancer versus age and cigarette smoking in men. Section 4 uses the methods 
of Section 2 to evaluate the fit of the two functions to data from cohort studies of lung cancer 
among male British physicians 3 and U.S. veterans, 4 and to data from a case-control study of lung j 

cancer among non-Hispanic white men in New Mexico. 5 / 
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Figure 2. Scatter plot of respiratory cancer death rates versus cigarette consumption in cigarettes per capita, from data 

shown in Figure 2. 

Curve is least-squares regression line given by y = 0-6746 + 00076a:, where y is the respiratory cancer death rate times 10 5 in 
year f, and x is cigarettes per capita in year t — 2Q. R 2 — 0 93. 


Throughout the paper I shall not distinguish between mortality rates and incidence rates for lung 
cancer. Any inaccuracies so introduced are likely to be small, because the median time between 
diagnosis and death is less than one year, 

2. APPLICATION TO COHORT AND CASE-CONTROL STUDIES OF LUNG CANCER 

Let c(f) denote a vector of covariates measured on a subject at age t, and let z(-)] denote 

the lung cancer death rate at age t for a subject with smoking and covariate histories e(s) and z(s), 
respectively, 0 ^ s < t. We model the death rate as 

;.[r;c(-),c(-)] =. 0 [ric(-)Mx(/)/J]. " 0) 

Here y is a specified age- and smoking-specific baseline function such as those introduced in 
Section 3, r(y) is a fixed non-negative function of its argument y, x(t) is a p-dimensionat row vector 
whose components are functions of z(s), 0 ^ s < i, and possibly of t and e(s), 0 < s < t, and /? is a 
p-dimensional column vector of unknown parameters to be estimated. 

2.1. Application to cohort data 

If P is estimated via the grouped covariate methods described by Holford, 6 Laird and Olivier, 7 and 
Breslow et a!., a then the covariate space of x(t) is partitioned into K regions, and r[x(r)^] is 
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assumed constant within each of them. Let r k f/1) denote the value assumed by r when jc(r) belongs to 
the kth region, k — 1, ..., K. Denote by ri(u)andc ( (')theageand smoking history of the ith study 

subject at followup time u, i = 1. n. We introduce for each subject and each region a function 

Yi{u;k) which assumes the value 1 if his covariate value Xj[fi(u)] lies in the klh region, and 0 
otherwise, k — l, .... K. His contribution to the expected number of deaths for the fcth region, 
due to smoking and age, is then 

Gi(Ar) = | Li(u;k)f/[ti(ti};ci(,')]dti. 

The sum E k = Z*GiM of these contributions over all subjects replaces the usual expected number 
of deaths computed from external death rates. 

Under standard assumptions of independent failure times and independent censoring times, the 
likelihood of the observed data is proportional to that of K mutually independent Poisson variates 
O k ,k= 1, . .. , K. 6 - 7 Here O k is the number of subjects whose covariates at time of lung cancer 
death lie in the fcth region. Its mean is £[C\] = E k r k (P). 

The relative risk functions r k (jl) can be modelled in a variety of ways, usually with the dimension 
p of/? assumed less than K. In the fully parameterized model, p = K and r k (fi) is taken as exp/4. 
Then the maximum likelihood estimate /? gives the relative risk in region k as 

exp/4 = OJE k , 

the well-known standardized mortality ratio for the region. The hypothesis that /? = Ocan be tested 
using likelihood ratio methods. Alternatively, the score test yields the familiar statistic 


Y{O k -E k f/E k , (2) 

k 


which has, under the null hypothesis, an asymptotic chi-square distribution on K degrees of 
freedom. 

The above methods can be implemented with the software package GLIM-3, 9 using Poisson 
errors, the log link, and the values log E k as ‘offsets’. In addition, they can be modified to explore 
alternatives to the proportional hazards model (1), such as one in which the environmental 
covariates act additively on the smoking- and age-specific baseline death rates. 

A specified baseline function g also can be used with the partial likelihood methods of Cox. 2 0 To 
describe this approach we treat time as age. although other specifications, such as time since start of 
follow-up, can be accommodated. Under the standard independence assumptions, ft can be 
estimated from full cohort studies by maximizing the partial likelihood function 
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Here the product is taken over all subjects in the cohort, <5,- assumes the value 1 ifthe/th subject dies 
with lungcancer during follow-up and 0 otherwise, and R t is the set of indices of subjects alive at the 
age t ; when the ith subject dies or is censored. Finally = y_j f r[Xj(!j)/Tl, where the known 
constant 

» = (4) 

represents the smoking-induced death rate at age t, for thejth subject relative to that of the ith 
subject. The likelihood (3) differs from the usual one 10 only with respect to the known constants y 
of (4), and standard asymptotic arguments apply to the maximum partial likelihood estimates for /?. 
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2.2. Application to case-control and case-cohort data 

By modifying the risk sets in the manner described by Prentice, 11 one can adapt the partial 
likelihood function (3) for use with data from matched case-control and case-cohort studies. I shall 
not describe these methods in detail; rather I shall focus on application to frequency matched case- 
control studies, since this will be needed later. 

Frequency-matched case-control data can be analysed by incorporating the function g into the 
binomial likelihood methods described by Prentice and Pyke. 12 Suppose, for example, that within 
a fixed time period, cases are sampled from some target population. Controls free of lung cancer are 
chosen from the same population to insure that their relative frequency distribution over certain 
age strata equals that of the cases. It is assumed that a subject’s probability of developing lung 
cancer during the period is small relative to 1. Then from (1) this probability can be modelled 
approximately as 

P[D= llt,c(-),z(-)]=^Et;c(-)]»[-T(/)j3]/{I+s[t;c(-)]rC^(f)j8]}. (5) 

In (5) £) assumes the value 1 if the subject develops lung cancer during the period and 0 otherwise, 
and I is his age at the start of the period. 

It is also assumed that a subject’s probability of being sampled depends on his smoking and 
covariate histories only through his age stratum. The arguments of Prentice and Pyke then apply to 
show that likelihood based inferences for /) can be obtained by fitting the model (5) directly to the 
age-stratified case-control data as if they were obtained from a prospective study. Specifically, one 
maximizes a likelihood that is a product of components, one for each age stratum. The component 
for the yth stratum has the generalized logistic form 

n {wMfij/j ]} d ‘/{ 1 + y rtAxtum}- (6) 

i 

Here i/y is a parameter that depends on the sampling fractions of cases and controls in the stratum, 
7i — y I f,;c, (-)] is theith person’s age-and smoking-specific hazard rate, and t indexes the subjects in 
the stratum. 

When r has the exponential form r(>>) — e 1 ', estimation and hypothesis testing for fi can be 
accomplished on GLIM-3 using binomial errors, the logistic link, and the values of log y f as 
‘offsets’. 

The hypothesis fi = 0 can be tested by the likelihood ratio statistic or the score statistic. The 
latter has a form analogous to (2) when the p-dimensional covariate space of x(?) is partitioned into 
K regions, with r[x(f)/f] = exp/? k when x(r) belongs to the kth region, k =],..., K. To describe it, 
let 0 Jk and E ]k represent, within the yth age stratum, the observed and ‘expected’ numbers of cases 
whose covariates at time of interview belong to the kth region. Here E )k is defined as 

Ej k — £ PjiZ ik , ( 7 ) 

( 

where P;, = ytfjIO +y,>?;), rjj satisfies E,. — Oj., z ik — 1 if the ith subject’s covariate vector lies in 
region k and 0 otherwise, and i indexes subjects in the yth age-stratum. Finally let 

V Jk = '£Pj l (l~P Ji )z !k 

i 

represent the binomial variance component for stratum j and region k. The score statistic can now 
be written 

Z(0. k -E. k ) 2 !V. k . 

k 
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Figure 3. Lung cancer death rates among U.S. males who never smoked. 

Data points represent five year age-spccific rates based on 186 lung cancer deaths among 94,000 U.S. males observed 
between 1960-1972.Fitted curve. g[t) = 2*01 x 10 “ 12 (t — 5) 4 * 5 where f is age in years, was obtained from least-squares 
regression of the tog of death rates versus the log of the midpoints of the five-year age intervals, minus five years. 


i 


3. CIGARETTE SMOKING AND LUNG CANCER 

Figure 3 shows five-year age-specific mortality rates based on 186 lung cancer deaths observed by 
the American Cancer Society in the period 1960-1972 among 94,000 US male lifelong non- 
smokers. 13 The figure shows that the rates are described well by the power of age function 

g(t) = 2'01 x 10“ 12 (r — 5) 4 ' 50 , (8) 

where r is age in years. The exponent 4-50 and the proportionality constant 2-01 xT0~ 12 were 
obtained by weighted least-squares regression of the logs of observed death rates against the logs of 
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the midpoints of five-year age intervals, minus five years. The five-year lag was chosen arbitrarily. 
Any of several power functions, such as tj(t) = 3-32 x 10“ ’ J r 4 ' a ‘ l t provides an equally good fit to 
these data. 

1 shall consider two functions that describe individual male lung cancer death rates in terms of 
the individual’s age and lifetime smoking history. Both functions reduce to the power of age 
function (8) for lifelong non-smokers. 

3.1. Packs function g l 

The first function specifies that the excess death rate due to smoking depends linearly on the 
cumulative amount smoked: 

g, [«;<:(•)] = 2-01 x 10~ ,2 (£-5) 4 50 (1 TaPKS). (9) 

Here c(-) denotes past smoking history at rate e(s), 0 < s < t, PKS denotes the total number of 
packs of cigarettes smoked by age t — 5, and a is a constant to be specified. 

For men who smoke at constant rate c(r) = c packs per year from age r 0 to age t x < t — 5, 
PKS = c{fi — to), and (9) becomes 

0i [t;c{-)3 = 2-01 x HT 12 (f - Sf 3 [1 + wr(fi - to)]- (10) 

Therefore the excess death rate among continuing smokers is proportional to their smoking rate c, 
and, for f 0 small relative to t, roughly to the 5-5th power of their age. Men who smoke two packs a 
day for ten years incur the same additional risk as those of the same age who smoke half a pack a 
day for forty years. According to (10), excess rates among ex-smokers rise in proportion to total 
amount smoked, independently of ages started or stopped, and in proportion to the 4-5th power of 
current age. This sharp rise with time since termination of smoking conflicts with qualitative 
aspects of some data sets. 1415 Therefore, a second function g 2 was chosen to reflect more 
accurately the evolution of risk among ex-smokers. To motivate it, I shall review briefly the 
epidemiology of lung cancer in cigarette smokers. 

Figure 4 shows lung cancer incidence rates versus age in male non-smokers and versus duration 
of cigarette smoking in male British physicians who did not change their smoking rate during more 
than twenty years of mortality observation. These data, taken from Doll 16 and graphed on a 
log-log plot, suggest that rates among regular smokers increase with the same power of smoking 
duration as that describing rates versus age in non-smokers. 

In analysing further follow-up of these data, Doll and Peto 3 found that incidence rates among 
men aged t years increase roughly as the 4-5th power of (f —22-5), times a quadratic function of 
smoking rate. If smokers are assumed to start smoking at age 17-5 years, and if rates at age t depend 
on smoking histories five years earlier, then t — 22*5 represents duration of‘'effective'smoking. Thus 
the British rates are proportional to the 4-5th power of smoking duration, in agreement with the 
relationship (8) between mortality rates and age among non-smokers. 

A quadratic dependence of incidence rates on smoking rate is what might be expected if smoking 
strongly affected two of the ‘stages’ in the multistage theory of carcinogenesis. In the simplest 
version of this theory that is consistent with the above observations, stem cells in the bronchial 
epithelium undergo two or more discrete heritable changes prior to generating a clone ofmatignant 
cells that eventually becomes detectable clinically. The fourth or fifth power of age noted above is 
consistent with a theory involving 5 or 6 such changes, or more plausibly, involving fewer changes 
but with selective mitotic division of partially transformed cells. 1719 

T o determine the contributions to the stage transition rates of cigarette smoking relative to other 
‘background’ exposures, one needs data on how lung cancer rates vary with fluctuating smoking 
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Figure 4. Lung cancer incidence rates versus duration of cigarette smoking in male British physicians who were regular 
smokers, and versus age in lifelong non-smokers. Source: Doll 16 


patterns. Such data are sparse. One exception is data from a hospital-based study of 6920 male lung 
cancer patients and 13,460 male controls in five Western European countries, 15 analysed by Brown 
and Chu. 1 Another is data from a population-based case-control study of 233 non-Hispanic white 
male lung cancer patients and 373 non-Hispanic white male controls in New Mexico. 5 Both 
analyses used logistic regression models to estimate relative risks in discrete smoking categories. 
Brown and Chu assumed that smoking alTects the first and the penultimate of several stages. They 
used the observed temporal pattern Of risk among ex-smokers to deduce that the contribution of 
smoking relative to background for the penultimate stage was roughly double that for the first 
stage. 


3.2. Multistage function g z 

The second function was chosen to agree with the relation {8) for death rates among non-smokers, 
to agree qualitatively with incidence rates among the British smokers who did not change their 
smoking rate, and to agree with Brown and Chu’s observations on the evolution of ex-smokers’ risk 
with time since quitting. It specifies that cigarette smoke strongly affects transition to the first and 
the penultimate of at least three stages, and that its effect on transition rates to the penultimate 
stage is double that of the first stage, The resulting death rate at age t among males who have 
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smoked at rate c{s), 0 s < t, is given by 


Si [t;c(-)] = 2-01 x 10- 12 


(i - 5) 4 ' 5 + 9-00/j 



c-(s)s 3 ' s ds + 4-50/) J c(s) (t — 5 — s) 3 ' 5 ds 


+ 31-5p 2 



r(s)c(u) (5 - u)d«ds 


(ID 


Here p is a constant to be specified. The derivation of (ll)can be found in Whittemoreand Keller.' 7 
Its use requires the following information for each subject who has smoked cigarettes: 

(a) age at start of smoking; 

(b) any ages at which smoking rate changed; 

(c) smoking rate during each smoking interval. 


It will be convenient to evaluate (11) for men who smoke at a constant rate from age t 0 to age 
t, < t-5: 

<J 2 [f;c(-)] = 2-01 x 10-‘ 2 {(r-5) 4 ' 5 +pc(i +2 pc) (t, -1 0 )*' 5 + 2 / «r[(t 1 A - s — (12) 

According to (12), excess rates among continuing smokers increase with age at start of smoking and 
with current age, and increase quadratically with smoking rate. Excess rates also increase with time 
since stopping, but at a slower rate than do those given by g t . 


4. GOODNESS-OF-F1T 

The score and likelihood ratio statistics of Section 2 will now be used to evaluate goodness-of-fit of 
the functions g t and g 2 to cohort and case-control data among men with no known exposures to 
lung carcinogens other than tobacco. This is done by choosing the covariates x(f) in (1) to be 
indicators for certain categories of smoking and age. The test statistics then evaluate the adequacy 
of t/, and g 2 as summary descriptions of lung cancer death rates, relative to a more general model 
that allows them also to vary within the categories specified by x{t) (for further discussion see 
Tsiatis 20 and Whittemore and McMillan 21 ). 

4.1. British data 

Table I shows the distribution of 201 observed and predicted lung cancer deaths among those 
British physicians whose smoking rates remained constant over the follow-up period. Observed 
deaths are from Table 3 of Doll and Peto. 3 To compute the number of deaths predicted by the two 
functions, 1 assumed that all smokers started at age 17 5 years and smoked at a constant rate. Since 
individual reported smoking rates were not published, 1 also assumed that all person-years in an 
age-smoking category of Table I were contributed by men of age equal to the midpoint of the age 
interval, and who had smoked at the median rate for the category. Thus predicted numbers of 
deaths were obtained by multiplying the person-years in an age-smoking category by the rates (10) 
and (12), with c given by the median smoking rate for the category, t given by the midpoint of the 
age interval, r„ = 17-5 years, and = t — 5. The parameters a = 1-13 x 10" 3 and p = 0-207 were 
determined by equating total observed and expected numbers of deaths. 

When x(f) represents indicators for 71 of the 8 x 9 =■ 72 joint age-smoking categories, the 
asymptotic null distribution of the score statistic (2) is chi-square on 71 degrees of freedom. Its 
values are 50-7 for and 50-8 for g 2 , indicating acceptable agreement between observed numbers 
of deaths and those predicted by both functions. The many small expected values in the table 
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Tabic I. Observed [Op and predicted (£)t numbers of lung cancer deaths and person-years (PYB at risk among male British physicians by age and smoking habit 
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0-1 

2 

2-3 

2-5 

0-1 

7 

■12-0 

2-4 

43 

26 

24*7 

213 

58 

52*9 

5PJ 

13-0 30 

3P1 

32-9 

10-6 36 

28-9 

333 

6-0 

30 

22-4 

291 

179-2 

201 

201*0 

21) P0 


* Source: Doll and Pclo. 3 

f According to formulae (10) a rid (12). All smokers were assumed to start smokingalage 17-5 years and to continue at the rate given by the mean for their smoking group. Age at risk was taken as the midpoint ol 
the age interval, 
t Person-years in hundreds 
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compromise the asymptotic distribution of the statistic. Nevertheless, when the score test was 
applied to the two margins of Table I, it also failed to distinguish the two functions {yi — 4-7 and 8-9, 
respectively, for the marginal age categories, and xl = 12-2 and 5-7 for the smoking margins). 

4.2. U.S. Veterans data 

Table II gives the distributions of 1177 observed and predicted lung cancer deaths among 
approximately 200,000 non-smoking and current cigarette smoking male U.S. veterans, classified 
by age at death and smoking rate at time of interview. These data, taken from Kahn, 4 represent 8-5 
years of mortality observation subsequent to interview. Predicted tung cancer deaths were 
obtained from (10) and (12) by assuming that smokers smoked at a constant rate. All person years 
in an age-smoking category were assumed to be contributed by men of age and smoking rate equal 
to the midpoints of the age and smoking intervals, with the highest smoking interval represented by 
45 cigarettes per day. Ages at start of smoking, specific for current age. were determined from the 
distributions given for males in Table 3 of Hammond and Garfinkle. 22 For example, 69 8 per cent 
of the 116,300 person-years in the age category 35^44 years were assumed to be contributed by men 
who started smoking at age 15, 27-2 per cent by men who started smoking at age 25, etc. The 
parameter values a = 0 59 x 10“ 2 and p = 0028 were determined by equating total observed and 
expected numbers of deaths. These values are smaller than those found for the British data, 
indicating smaller smoking-specific death rates for the U.S. veterans. (When the values obtained 
from the British data were applied to the veterans data, g s and y 2 predicted 3244 and 2132 deaths, 
respectively, in contrast to the 1177 observed.) 

Table II shows that neither g, nor g 2 provides an adequate fit to the U.S. data. The score statistic 
obtained by taking x(t) to be a vector of indicators for 24 of the 25 categories equals 61 0 for fiq.and 
76-9 for g 2 . When compared to a chi-squared distribution on 24 degrees of freedom, these values 
indicate highly significant departures from the observed counts. This lack of fit contrasts with the 
good agreement of g t and g 2 to the British data. It may be due to changes in smoking rates by the 
veterans, both before and after they reported their current rates at interview. 

In analysing the same data, Robins 22 .also found that the British physicians experienced higher 
smoking-specific death rates than did the U.S. veterans. He reported that the veterans rates agree 
with those of men in Western Europe, as studied by Lubin et al. 15 The excess relative risk of 
0-59 x 10“ 2 per pack found here for the U.S. veterans is similar to the value 0-51 x 10“ 3 found for 
U.S. uranium miners, when total packs of cigarettes were cumulated until ten years before age at 
risk. 24 

4.3. New Mexico data 

As a final test of g x and g 2 , their predictions are compared to observations among 233 non- 
Hispanic white male residents of New Mexico diagnosed with lung cancer during the period 
I January, 1980 to 31 December, 1982, and 373 non-Hispanic white male control residents. These 
data, kindly provided by Jonathan Samet, were obtained as part of a case-control study of lung 
cancer among both Hispanic and non-Hispanic white men and women in New Mexico. 5 

The study included all eligible lung cancer cases aged 25 to 50 years, and a random sample of 
those aged 50 through 84 years. Control subjects were chosen from the general population so that 
their distribution matched that of the cases within each of the four age intervals: < 55 years; 55-64 
years; 65-74 years; and 75-84 years, with an overall ratio of approximately 1-5 controls per case. 
Detailed lifetime smoking histories were obtained in personal interviews with subjects or. if 
deceased, with their next-of-kin. Therefore use of these data provides a more definitive test of 
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Table II. Observed |Oj' and predicted (EJ+ numbers of lung cancer deaths and person-years |PY)J among male U.S. veterans by age and current smoking habit 


Current smoking rate in cignrcHcs/dny 

Age Non-smokers 

(years) or occasional 

smokers 1-9 10-20 21-39 39+ Total 



PY O 

£ 

PY 

O 

E 

tii 

9: 

PY 

0 

i 

L<\ 

£ , 

Ql 

PY 

0 

E 

Si 


PY 

0 

E 

0 i 

9i 

PY 

0 

Hi 

£ 

Ui 

35-44 

352 0 

07 

8! 

(1 

0-3 

0-1 

60(1 

2 

44 

58 

406 

4 

5-2 

7-9 

40 

0 

0-7 

1-3 

1479 

6 

11-3 

15-9 

45-54 

151 0 

0-9 

31 

0 

04 

0-4 

164 

2 

5-1 

5-6 

128 

10 

7-3 

94 

19 

2 

1-6 

24 

493 

14 

15-3 

18-7 

55-64 

2139 25 

3 P 2 

452 

31 

19-3 

16-6 

1517 

183 

150-9 

141-7 

1030 

24 J 

189-9 

216-2 

196 

63 

529 

71*5 

5334 

547 

444-2 

477-2 

65-74 

1712 49 

530 

371 

44 

394 

29-8 

1017 

239 

260-8 

2158 

500 

194 

241*1 

2474 

89 

50 

63-2 

78-4 

3689 

576 

657-5 

624-4 

75- S 4 

85 4 

5-0 

19 

5 

44 

3'0 

39 

15 

21-8 

16-0 

13 

7 

13*6 

12-5 

2 

3, 

3-7 

4-1 

157 

34 

48-5 

40-6 

Total 

4439 7 S 

90-8 

954 

80 

63-8 

50-1 

3336 

441 

443-0 

384-9 

2078 

460 

457 I 

4934 

347 

118 

122-1 

157 6 

11154 

1177 

11770 

11770 


* Source: Kahn 4 . 

•f According lo formulae 110) and (12). Smokers were assumed to slant smoking at 5 cigaretlcs per day and to increase their rale linearly until their current smoking rate. Distributions of age alSturt of 
smoking, specific for current age, were determined from Table 3 of Hammond and Garlinkle (1961). 
f Person-years in hundreds. 
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Table III. Models for smoking data among 233 male lung cancer cases and 373 
male controls in New Mexico* 


Variables included 

None 

Deviancet 

Offset 
log 9i 

log g z 

Number of fitted 
parameters 

aget 

805-7 

692-7 

6720 


age; PK-YRS§ 

696-9 

689-0 

671-3 

8 

age; status! 

709-2 

669-6 

664-2 

6 

age; status; 
status x age 

706-2 

665-9 

660-5 

12 


* Source: J. M. Samet (personal communication) 
t Deviance equals minus twice the maximized log-likelihood 
| Age categories: < 55; 55-64; 65-74; 75-84 years 

§ PK.-YR categories; 0; 1-9; 10-19; 20-29; 30+ (one PK-YR equals 365 packs of cigarettes) 
|| Cigarette smoking status: non-smoker, ex-smoker, current smoker 


and g 2 than does use of the published British and U.S. veterans data, since the latter requires 
unverifiable assumptions about subjects’ entire smoking histories. 

The functions and g 2 were fit to the data using the methods described in Section 2 for 
frequency-matched case-controi data. Analyses were implemented using GLIM-3. Each subject’s 
‘offset’ was determined by substituting his reported smoking history in equations (9) and (11). The 
parameter values a — 0-59 x 1CT 3 and p — 0-128 obtained from the U.S. veterans data fit the New 
Mexico data considerably better than the ones obtained from the British data; therefore I report 
only results using these values. 

Table III gives deviances (minus twice the maximized log-likelihood values) for the likelihood 
function described by (6), with r[x(f)/?] = exp[/?x{t)], The entries in the table correspond to 
models specified by different choices for the constants y ; and the covariate vector x{f). Since cases 
and controls were frequency matched in the four age strata described above, all models must 
contain the corresponding parameters r},, . . ., t]*. 

The first row of the table gives deviances for models containing only these four age parameters, 
together with one of three possible offsets: 0 (i.e. no offset and y ; = 1), log g l3 and log g 2 . The 
deviances for the two models containing g t and g z are considerably smaller than for the model 
containing only the age variables, indicating that both functions explain more of the variation in 
the data. Further, g 2 does better than 

Tire remaining rows of the table gives deviances for models in which x(r) represents indicators for 
various smoking categories. Thus the deviances in the first column are those of standard logistic 
regression models containing age and smoking as categorical variables. Comparison of these 
deviances with the values 692-7 and 672 0 for the two models in row 1 shows that g , and g 2 explain 
much more variation in these data than do the standard models, despite the latter’s use of more 
fitted parameters. 

The second and third columns evaluate the adequacy of g x and g 2 relative to more general 
models allowing risk to vary within the specified smoking and age-smoking categories. More 
explicitly, the difference between a deviance in row 1 and that of a lower row has, under the null 
hypothesis that the g function fits the data, an asymptotic chi-square distribution, with degrees of 
freedom equal to the difference in number of fitted parameters. 
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Table IV. Likelihood ratio (LR) statistics for fit oT PKS 
function g , and multistage function y 2 to New Mexico 
lung cancer data* 


Model 

LR statisticf 

DF 


9 1 

02 


g: age; PK-YRSJ 

3-7 

0-7 

4 

g; age; status^ 

19-41 

7'IS 

2 

!); age; status; 
status x age 

26-811 

11-5 

8 


* Source: J. M. Samet (personal communication) 
t Twice the difference in maximized log-likelihood between 
given model and one that includes only age and g (see Table III). 
Under the null hypothesis that the latter model has generated the 
data, LR has an asymptotic chi-square distribution with degrees 
of freedom equal to the difference in number of fitted parameters. 

See footnotes to Table II) for definition of variables 
(j p < 005 

11 p < 001 


Table V. Observed (O)* and predicted (£)t numbers of lung cancer cases among non-Hispanic white 
males in New Mexico, 1 January, 1980 to 31 December, 1982 


Smoking status 


Non-smoker Ex-smoker Current smoker Total 


Age (years) 

O 

0i 

E 

02 

O 

3i 

E 

9i 

O 

E 

3i 

02 

0(=£) 

< 55 

1 

1-5 

1-5 

2 

6-2 

5-3 

25 

20-3 

21-2 

28 

55-64 

i 

1-2 

1-3 

9 

14-8 

12-1 

57 

51-0 

53-6 

67 

65-74 

5 

3-0 

3-9 

22 

30-9 

25-9 

70 

63-1 

67-2 

97 

75-84 

2 

1-6 

2-2 

12 

16-6 

14 5 

27 

22-8 

24-2 

41 

Total 

9 

7-3 

' 9-0 

45 

68-4 

57-9 

179 

157-2 

166-1 

233 


* Source: J. M. Samet (personal communication; for details see Pathak et al. s ). 
t From f/, of equation (9) and cr 2 of (11). 
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These likelihood ratio statistics are shown in Table IV. It is evident from the table that according 
to these criteria, g 2 fits the data better than does . Nevertheless, there is indication of poor fit for 
g 2 relative to the richer model that allows additional variation in risk among the non-smoker, ex¬ 
smoker and current smoker categories (p < 0 05). Table V gives observed and predicted numbers 
of cases in each of the joint age-smoking categories. The predicted numbers are the E jk of (7). 

Table V shows that g 2 predicts too many cases among the ex-smokers, and, correspondingly, too 
few among the current smokers. This imbalance is even worse forg^ (Within each age group, total | 1 

observed and predicted numbers are equal because of the way the age parameters are estimated.) I 

Models containing categorical variables for cigar and pipe smoking were examined, but none j 2 

was found to provide significant improvement over those involving only or g 2 . { 


Er 

by 
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5. DISCUSSION 

I have considered two descriptions of male lung cancer death rates as Functions of age and previous 
cigarette smoking history. Their predictions agree well with the age dependence of rates among 
male non-smokers, and with the smoking duration and smoking rate dependence of rates among 
male British physicians. The British data give higher smoking-specific death rates than do cohort 
and case-control data from the U.S. Therefore no one function can adequately describe all the 
death rates observed in existing studies. 

Both functions provided poor fit to lung cancer data among U.S. veterans, even when allowance 
was made for the veterans’ reduced susceptibility to smoking, relative to that of the British 
physicians. The poor fit may be due to inappropriate assumptions about subjects’ smoking 
histories. 

When fit to more detailed lifetime smoking data among men in New Mexico with and without 
lung cancer, both functions accounted for substantially more variation than did standard 
categorical regression models. Some of this improvement may be due to the more precise control 
for the effects of age afforded by the functions. Nevertheless, the similar deviances {696-9 versus 
692-7) for the age-PK-YRS and age-#! models in Table IV suggest that this explanation cannot 
account for all of the improvement. 

The New Mexico data were more compatible with the multistage function g 2 than with the 
simpler packs function . However, both functions overpredicted death rates among ex-smokers, 
suggesting that smoking may have an even stronger relative effect on later transitions in the 
carcinogenic process than assumed here. Further analyses of data among ex-smokers is needed to 
clarify this issue. 

Potential advantages of a function such as g 2 include more precise control for cigarette smoking 
in studies of other exposures, and more accurate assessment of interactions between smoking and 
such exposures. There are also limitations to its use. One is its exclusion of factors such as tar and 
nicotine content, depth of inhalation, and histologic type of lung cancer. Another is its need for the 
timing and intensity of subjects’ past smoking habits, which may be reported inaccurately or be 
unavailable. It is noteworthy, however, that even though smoking histories were reported by next- 
of-kin for some subjects in the New Mexico study, g 2 explained the data considerably better than 
did categorical logistic regression models. 

Despite the above limitations, the function has potential utility in monitoring new lung 
carcinogens, and in examining death rates among men exposed to known carcinogens. Further 
work is needed to refine g 2 so that it more accurately reflects death rates among male ex-smokers, 
and to develop and validate a quantitative description of the age and smoking dependence of lung 
cancer death rates among women. 
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